27 research outputs found
A logic for document spanners
Document spanners are a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). One of the central models in this framework are
core spanners, which formalize the query language AQL that is used in IBM’s
SystemT. As shown by Freydenberger and Holldack (ICDT 2016, ToCS 2018), there is a connection between core spanners and ECreg, the existential theory of concatenation with regular constraints. The present paper further develops this connection by defining SpLog, a fragment of ECreg that has the same expressive
power as core spanners. This equivalence extends beyond equivalence of expressive power, as we show the existence of polynomial time conversions between SpLog and core spanners. Consequences and applications include an alternative way of defining relations for spanners, a pumping lemma for core spanners, and insights into the relative succinctness of various classes of spanner representations and their connection to graph querying languages. We also briefly discuss the connection between SpLog with negation and core spanners with a difference operator
A logic for document spanners
Document spanners are a formal framework for information extraction that was introduced by
Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). One of the central models in this framework are core spanners, which are based on regular expressions with variables that are then extended with an algebra. As shown by Freydenberger and Holldack (ICDT 2016), there is a connection between core spanners and ECreg, the existential theory of concatenation with
regular constraints. The present paper further develops this connection by defining SpLog, a fragment of ECreg that has the same expressive power as core spanners. This equivalence extends beyond equivalence of expressive power, as we show the existence of polynomial time conversions between this fragment and core spanners. This even holds for variants of core spanners that are based on automata instead of regular expressions. Applications of this approach include an
alternative way of defining relations for spanners, insights into the relative succinctness of various classes of spanner representations, and a pumping lemma for core spanners
Extended regular expressions: succinctness and decidability
Most modern implementations of regular expression engines allow the use of variables (also called backreferences). The resulting extended regular expressions (which, in the literature, are also called practical regular expressions, rewbr, or regex) are able to express non-regular languages. The present paper demonstrates that extended regular-expressions cannot be minimized effectively (neither with respect to length, nor number of variables), and that the tradeoff in size between extended and "classical" regular expressions is not bounded by any recursive function. In addition to this, we prove the undecidability of several decision problems (universality, regularity, and cofiniteness) for extended regular expressions. Furthermore, we show that all these results hold even if the extended regular expressions contain only a single variable. © 2012 Springer Science+Business Media, LLC
The unambiguity of segmented morphisms
The unambiguity of segmented morphism
Inferring descriptive generalisations of formal languages
In the present paper, we introduce a variant of Gold-style learners that is not required to infer precise
descriptions of the languages in a class, but that must find descriptive patterns, i.e., optimal
generalisations within a class of pattern languages. Our first main result characterises those indexed
families of recursive languages that can be inferred by such learners, and we demonstrate that this
characterisation shows enlightening connections to Angluin’s corresponding result for exact inference.
Using a notion of descriptiveness that is restricted to the natural subclass of terminal-free
E-pattern languages, we introduce a generic inference strategy, and our second main result characterises
those classes of languages that can be generalised by this strategy. This characterisation
demonstrates that there are major classes of languages that can be generalised in our model, but not
be inferred by a normal Gold-style learner. Our corresponding technical considerations lead to deep
insights of intrinsic interest into combinatorial and algorithmic properties of pattern languages
The unambiguity of segmented morphisms
This paper studies the ambiguity of morphisms in free monoids. A morphism
σ is said to be ambiguous with respect to a string α if there exists a morphism
τ which differs from σ for a symbol occurring in α, but nevertheless satisfies
τ(α) = σ(α); if there is no such τ then σ is called unambiguous. Motivated
by the recent initial paper on the ambiguity of morphisms, we introduce the
definition of a so-called segmented morphism σn, which, for any n ∈ N, maps
every symbol in an infinite alphabet onto a word that consists of n distinct
factors in ab+a, where a and b are different letters. For every n, we consider
the set U(σn) of those finite strings over an infinite alphabet with respect to
which σn is unambiguous, and we comprehensively describe its relation to any
U(σm), m ≠n.
Thus, our work features the first approach to a characterisation of sets of
strings with respect to which certain fixed morphisms are unambiguous, and
it leads to fairly counter-intuitive insights into the relations between such sets.
Furthermore, it shows that, among the widely used homogeneous morphisms,
most segmented morphisms are optimal in terms of being unambiguous for a
preferably large set of strings. Finally, our paper yields several major improvements
of crucial techniques previously used for research on the ambiguity of
morphisms
Inferring descriptive generalisations of formal languages
In the present paper, we introduce a variant of Gold-style learners that is not
required to infer precise descriptions of the languages in a class, but that must
nd descriptive patterns, i. e., optimal generalisations within a class of pattern
languages. Our rst main result characterises those indexed families of recursive
languages that can be inferred by such learners, and we demonstrate that
this characterisation shows enlightening connections to Angluin's corresponding
result for exact inference. Furthermore, this result reveals that our model
can be interpreted as an instance of a natural extension of Gold's model of
language identi cation in the limit. Using a notion of descriptiveness that is
restricted to the natural subclass of terminal-free E-pattern languages, we introduce
a generic inference strategy, and our second main result characterises
those classes of languages that can be generalised by this strategy. This characterisation
demonstrates that there are major classes of languages that can be
generalised in our model, but not be inferred by a normal Gold-style learner.
Our corresponding technical considerations lead to insights of intrinsic interest
into combinatorial and algorithmic properties of pattern languages
Existence and nonexistence of descriptive patterns
In the present paper, we study the existence of descriptive patterns, i. e. patterns that cover
all words in a given set through morphisms and that are optimal in terms of revealing
commonalities of these words. Our main result shows that if patterns may be mapped to
words by arbitrary morphisms, then there exist infinite sets of words that do not have a
descriptive pattern. This answers a question posed by Jiang et al. (Pattern languages with
and without erasing, International Journal of Computer Mathematics 50 (1994)). Since the
problem of whether a pattern is descriptive depends on the inclusion relation of so-called
pattern languages, our technical considerations lead to a number of deep insights into the
inclusion problem for and the topology of the class of terminal-free E-pattern languages
Bad news on decision problems for patterns
We study the inclusion problem for pattern languages, which
is shown to be undecidable by Jiang et al. (J. Comput. System Sci. 50,
1995). More precisely, Jiang et al. demonstrate that there is no effective
procedure deciding the inclusion for the class of all pattern languages
over all alphabets. Most applications of pattern languages, however, consider
classes over fixed alphabets, and therefore it is practically more
relevant to ask for the existence of alphabet-specific decision procedures.
Our first main result states that, for all but very particular cases, this
version of the inclusion problem is also undecidable. The second main
part of our paper disproves the prevalent conjecture on the inclusion
of so-called similar E-pattern languages, and it explains the devastating
consequences of this result for the intensive previous research on the
most prominent open decision problem for pattern languages, namely
the equivalence problem for general E-pattern languages
Existence and nonexistence of descriptive patterns
In the present paper, we study the existence of descriptive
patterns, i.e. patterns that cover all words in a given set through morphisms
and that are optimal in terms of revealing commonalities of these
words. Our main result shows that if patterns may be mapped onto words
by arbitrary morphisms, then there exist infinite sets of words that do
not have a descriptive pattern. This answers a question posed by Jiang,
Kinber, Salomaa, Salomaa and Yu (International Journal of Computer
Mathematics 50, 1994). Since the problem of whether a pattern is descriptive
depends on the inclusion relation of so-called pattern languages,
our technical considerations lead to a number of deep insights into the
inclusion problem for and the topology of the class of terminal-free Epattern
languages